Friedman Test (Nonparametric Repeated-Measures ANOVA)#

The Friedman test answers a very specific question:

When I measure the same blocks/subjects under k \u2265 2 conditions (treatments, models, UI variants, \u2026), do the conditions differ systematically, without assuming normality?

It\u2019s the rank-based analogue of repeated-measures ANOVA.


Learning goals#

By the end you should be able to:

  • decide when Friedman is the right test (and when it isn\u2019t)

  • map your data into the required (n_blocks \u00d7 k_treatments) matrix

  • compute the Friedman statistic step-by-step from within-block ranks

  • interpret the p-value and report an effect size (Kendall\u2019s W)

  • run a NumPy-only Monte Carlo / permutation view of the null distribution

Prerequisites#

  • Hypothesis testing basics (null, p-value)

  • NumPy arrays

  • Plotly for visualization (this notebook uses plotly_white)

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(7)

# Optional: SciPy cross-checks (the core implementation below is NumPy-only)
try:
    from scipy import stats
except Exception:
    stats = None

1) When to use the Friedman test#

Use Friedman when you have:

  • Paired / repeated measurements: each block has one observation for every treatment.

    • blocks: people, datasets, days, machines, \u2026

    • treatments: algorithms, drugs, UI variants, \u2026

  • k \u2265 2 treatments and n \u2265 2 blocks

  • a measurement scale that is at least ordinal (so ranking makes sense)

  • you don\u2019t want to assume normality (and you want a robust omnibus test)

Data layout#

Put your data in a matrix X of shape (n_blocks, k_treatments):

  • row i = block i (one paired set)

  • column j = treatment j

Hypotheses#

  • H0: all treatments have the same distribution (no systematic treatment effect)

  • H1: at least one treatment differs

The test is omnibus: if you reject H0, you learned \u201cnot all treatments are equivalent\u201d, but not which treatments differ.

Key assumptions (often overlooked)#

  • Blocks are independent of each other.

  • Within a block, treatment labels are comparable (same scale/units).

  • No missing values in the standard formulation.

2) Intuition: ranks within each block#

For each block (row), replace raw values by ranks 1..k.

  • If higher is better (e.g., accuracy), the largest value gets rank 1.

  • If lower is better (e.g., error), the smallest value gets rank 1.

Under H0, each treatment should be \u201crandomly\u201d spread across the ranks across many blocks, so the rank sums per treatment should be similar.

If one treatment is systematically better, it gets smaller ranks more often \u2192 its rank sum becomes noticeably smaller than the others.

3) The statistic (what is actually computed)#

Let:

  • n = number of blocks

  • k = number of treatments

  • r_ij = rank of treatment j within block i (rank 1 = best)

  • R_j = \u2211_i r_ij = sum of ranks for treatment j

The Friedman statistic is:

\[ Q = \frac{12}{n k (k+1)} \sum_{j=1}^k R_j^2 - 3n(k+1). \]

Tie correction (important in discrete data)#

If there are ties within blocks, ranks are averaged (e.g. two tied values both get rank 1.5). A standard tie correction is:

\[ C = 1 - \frac{\sum_{i=1}^n \sum_{g \in \text{ties in block } i} (t_g^3 - t_g)}{n k (k^2 - 1)}, \qquad Q_{\text{corr}} = \frac{Q}{C}. \]

Here, each tie group g has size t_g.

Interpretation#

  • Large Q means rank sums are more spread out than expected under H0 \u2192 evidence that treatments differ.

  • The p-value is an upper-tail probability: p = P(Q \u2265 Q_obs | H0).

Effect size: Kendall\u2019s W#

A common effect size is Kendall\u2019s W (coefficient of concordance):

\[ W = \frac{Q}{n (k-1)} \in [0,1]. \]
  • W \u2248 0 \u2192 little/no systematic ranking difference across treatments

  • W \u2248 1 \u2192 very strong, consistent ordering across blocks

(With ties, W is often computed from the tie-corrected Q.)

def rankdata_average_ties_1d(x: np.ndarray, *, descending: bool = False) -> np.ndarray:
    """Rank a 1D array with average ranks for ties.

    Returns ranks in {1,...,len(x)} as float.
    If descending=True, larger values get smaller (better) ranks.
    """
    x = np.asarray(x)
    if x.ndim != 1:
        raise ValueError("x must be 1D")
    if x.size == 0:
        raise ValueError("x must be non-empty")
    if not np.all(np.isfinite(x)):
        raise ValueError("x contains non-finite values")

    x_work = -x if descending else x
    order = np.argsort(x_work, kind="mergesort")
    x_sorted = x_work[order]

    ranks_sorted = np.empty_like(x_sorted, dtype=float)
    n = x_sorted.size

    i = 0
    while i < n:
        j = i + 1
        while j < n and x_sorted[j] == x_sorted[i]:
            j += 1

        # Items i..(j-1) are tied and would have ranks (i+1)..j.
        rank_avg = (i + 1 + j) / 2.0
        ranks_sorted[i:j] = rank_avg
        i = j

    ranks = np.empty_like(ranks_sorted)
    ranks[order] = ranks_sorted
    return ranks


def rank_rows_average_ties(X: np.ndarray, *, descending: bool = False) -> np.ndarray:
    """Rank each row of X independently (average ranks for ties)."""
    X = np.asarray(X)
    if X.ndim != 2:
        raise ValueError("X must be 2D with shape (n_blocks, k_treatments)")
    if not np.all(np.isfinite(X)):
        raise ValueError("X contains non-finite values")

    return np.vstack(
        [rankdata_average_ties_1d(row, descending=descending) for row in X]
    )
def friedman_statistic_from_ranks(
    ranks: np.ndarray, *, tie_correction: bool = True
) -> dict:
    """Compute Friedman Q (and Kendall's W) from a rank matrix.

    ranks has shape (n_blocks, k_treatments) and contains within-block ranks.
    """
    ranks = np.asarray(ranks, dtype=float)
    if ranks.ndim != 2:
        raise ValueError("ranks must be 2D")

    n, k = ranks.shape
    if n < 2:
        raise ValueError("Need at least n>=2 blocks")
    if k < 2:
        raise ValueError("Need at least k>=2 treatments")

    rank_sums = ranks.sum(axis=0)

    Q = (
        12.0 / (n * k * (k + 1.0)) * np.sum(rank_sums**2)
        - 3.0 * n * (k + 1.0)
    )

    correction = 1.0
    if tie_correction:
        tie_sum = 0.0
        for i in range(n):
            _, counts = np.unique(ranks[i], return_counts=True)
            counts = counts[counts > 1]
            if counts.size:
                tie_sum += np.sum(counts**3 - counts)

        correction = 1.0 - tie_sum / (n * k * (k**2 - 1.0))
        if correction <= 0:
            raise ValueError("Non-positive tie correction factor; check ranks")
        Q = Q / correction

    W = Q / (n * (k - 1.0))

    expected_rank_sum = n * (k + 1.0) / 2.0
    return {
        "n": int(n),
        "k": int(k),
        "Q": float(Q),
        "W": float(W),
        "rank_sums": rank_sums,
        "expected_rank_sum": float(expected_rank_sum),
        "tie_correction_factor": float(correction),
    }


def friedman_Q_from_rank_sums(rank_sums: np.ndarray, *, n: int, k: int) -> np.ndarray:
    """Compute Friedman Q from rank sums R_j (vectorized)."""
    rank_sums = np.asarray(rank_sums, dtype=float)
    return 12.0 / (n * k * (k + 1.0)) * np.sum(rank_sums**2, axis=-1) - 3.0 * n * (
        k + 1.0
    )


def friedman_null_Q_from_ranks(
    ranks: np.ndarray,
    *,
    tie_correction_factor: float = 1.0,
    n_resamples: int = 20000,
    seed: int = 0,
) -> np.ndarray:
    """Permutation null distribution of Q by shuffling ranks within each block.

    This is equivalent to permuting treatment labels within each block under H0.
    It also preserves tie patterns (because the multiset of ranks per block is fixed).
    """
    ranks = np.asarray(ranks, dtype=float)
    if ranks.ndim != 2:
        raise ValueError("ranks must be 2D")

    n, k = ranks.shape
    if n < 2 or k < 2:
        raise ValueError("Need n>=2 and k>=2")
    if n_resamples < 1:
        raise ValueError("Need n_resamples>=1")
    if tie_correction_factor <= 0:
        raise ValueError("tie_correction_factor must be positive")

    rng = np.random.default_rng(seed)

    # Random permutations per (resample, block)
    u = rng.random((n_resamples, n, k))
    perm = np.argsort(u, axis=2)

    ranks_perm = np.take_along_axis(
        np.broadcast_to(ranks, (n_resamples, n, k)), perm, axis=2
    )
    rank_sums = ranks_perm.sum(axis=1)
    Q = friedman_Q_from_rank_sums(rank_sums, n=n, k=k)
    return Q / tie_correction_factor


def friedman_test_numpy(
    X: np.ndarray,
    *,
    higher_is_better: bool = True,
    tie_correction: bool = True,
    n_resamples: int = 20000,
    seed: int = 0,
) -> dict:
    """Friedman test computed from scratch (NumPy-only), plus a Monte Carlo p-value."""
    X = np.asarray(X, dtype=float)
    if X.ndim != 2:
        raise ValueError("X must be 2D with shape (n_blocks, k_treatments)")
    if not np.all(np.isfinite(X)):
        raise ValueError("X contains non-finite values")

    ranks = rank_rows_average_ties(X, descending=higher_is_better)
    core = friedman_statistic_from_ranks(ranks, tie_correction=tie_correction)

    Q_null = friedman_null_Q_from_ranks(
        ranks,
        tie_correction_factor=core["tie_correction_factor"],
        n_resamples=n_resamples,
        seed=seed,
    )

    # Upper-tail p-value (add-one smoothing avoids returning exactly 0.0).
    p_value = (1.0 + np.sum(Q_null >= core["Q"])) / (n_resamples + 1.0)

    return {
        **core,
        "ranks": ranks,
        "Q_null": Q_null,
        "p_value_mc": float(p_value),
        "higher_is_better": bool(higher_is_better),
    }
# Tiny example with ties (two equal best values)
toy = np.array([[10.0, 10.0, 7.0, 3.0]])
ranks_toy = rank_rows_average_ties(toy, descending=True)

pd.DataFrame(
    {
        "value": toy[0],
        "rank (1=best)": ranks_toy[0],
    },
    index=["A", "B", "C", "D"],
)
value rank (1=best)
A 10.0 1.5
B 10.0 1.5
C 7.0 3.0
D 3.0 4.0

4) Worked example (algorithms across datasets)#

A classic use case is comparing several ML algorithms across multiple datasets (each dataset is a block).

We\u2019ll simulate k=4 algorithms evaluated on n=24 datasets with paired accuracies (higher is better).

n_blocks = 24
algorithms = np.array(["Algo A", "Algo B", "Algo C", "Algo D"])
k = algorithms.size

# Dataset difficulty / baseline accuracy
baseline = rng.uniform(0.65, 0.80, size=n_blocks)

# Systematic treatment effects (A < B < C < D in accuracy)
effects = np.array([0.00, 0.02, 0.04, 0.06])

noise = rng.normal(0, 0.02, size=(n_blocks, k))
scores = np.clip(baseline[:, None] + effects[None, :] + noise, 0.0, 1.0)

df = pd.DataFrame(scores, columns=algorithms)
df.insert(0, "block", np.arange(1, n_blocks + 1))
df.head()
block Algo A Algo B Algo C Algo D
0 1 0.746899 0.760026 0.733429 0.792990
1 2 0.783612 0.806848 0.793979 0.835027
2 3 0.746782 0.770176 0.827571 0.810202
3 4 0.683131 0.721469 0.712109 0.741547
4 5 0.697234 0.716301 0.710524 0.756548
# Paired nature: each line is one block (dataset)
fig = go.Figure()
for i in range(n_blocks):
    fig.add_trace(
        go.Scatter(
            x=algorithms,
            y=scores[i],
            mode="lines+markers",
            line=dict(width=1),
            opacity=0.55,
            showlegend=False,
            hovertemplate=f"block={i+1}<br>%{{x}}=%{{y:.3f}}<extra></extra>",
        )
    )

fig.update_layout(
    title="Paired accuracies (each line = one block)",
    xaxis_title="Algorithm",
    yaxis_title="Accuracy",
)
fig.show()
# Distribution per algorithm (still paired, but shows marginal spread)
df_long = df.melt(id_vars="block", var_name="algorithm", value_name="accuracy")
fig = px.box(
    df_long,
    x="algorithm",
    y="accuracy",
    points="all",
    title="Accuracy by algorithm (paired blocks)",
)
fig.update_layout(xaxis_title="Algorithm", yaxis_title="Accuracy")
fig.show()
result = friedman_test_numpy(
    scores,
    higher_is_better=True,
    tie_correction=True,
    n_resamples=20000,
    seed=123,
)

result["Q"], result["p_value_mc"], result["W"], result["tie_correction_factor"]
(49.950000000000045, 4.999750012499375e-05, 0.6937500000000006, 1.0)
# Summarize rank sums / mean ranks (lower is better)
summary = pd.DataFrame(
    {
        "algorithm": algorithms,
        "rank_sum": result["rank_sums"],
        "mean_rank": result["rank_sums"] / result["n"],
        "expected_rank_sum": result["expected_rank_sum"],
    }
).sort_values("mean_rank")
summary
algorithm rank_sum mean_rank expected_rank_sum
3 Algo D 32.0 1.333333 60.0
2 Algo C 50.0 2.083333 60.0
1 Algo B 65.0 2.708333 60.0
0 Algo A 93.0 3.875000 60.0
# Heatmap of within-block ranks (1=best)
fig = px.imshow(
    result["ranks"],
    x=algorithms,
    y=[f"Block {i}" for i in range(1, n_blocks + 1)],
    aspect="auto",
    color_continuous_scale="Viridis_r",
    title="Within-block ranks (1=best)",
)
fig.update_layout(xaxis_title="Algorithm", yaxis_title="Block")
fig.show()
# Mean rank plot (a common way to report Friedman results)
fig = px.bar(
    summary,
    x="algorithm",
    y="mean_rank",
    title="Mean rank per algorithm (lower is better)",
)
fig.update_layout(xaxis_title="Algorithm", yaxis_title="Mean rank")
fig.show()
# Null distribution of Q (Monte Carlo) + observed statistic
Q_null = result["Q_null"]
Q_obs = result["Q"]
p_mc = result["p_value_mc"]

fig = px.histogram(
    Q_null,
    nbins=60,
    title="Friedman Q under H0 (permutation of within-block ranks)",
)
fig.add_vline(
    x=Q_obs,
    line_color="crimson",
    line_width=3,
    annotation_text=f"Observed Q={Q_obs:.2f}<br>p\u2248{p_mc:.4f}",
    annotation_position="top right",
)
fig.update_layout(xaxis_title="Q", yaxis_title="count")
fig.show()

5) How to interpret the result (what it means)#

If p is small (e.g. < 0.05)#

  • You reject H0: it\u2019s unlikely that all treatments are equivalent.

  • Concretely: the observed spread of rank sums is too large to plausibly come from random rank assignment.

  • You still don\u2019t know which treatments differ \u2192 you need post-hoc comparisons.

If p is not small#

  • You do not reject H0: you don\u2019t have evidence of systematic differences.

  • This is not proof of equality; you might be underpowered (small n), or differences may be tiny.

Reporting tips#

  • Report Q, df=k-1, p, and an effect size like W.

  • Also report mean ranks (often more interpretable than raw Q).

# Optional: compare against SciPy's friedmanchisquare (asymptotic chi-square p-value)
if stats is None:
    print("SciPy not available; skipping cross-check.")
else:
    Q_scipy, p_scipy = stats.friedmanchisquare(*[scores[:, j] for j in range(k)])
    print(f"SciPy: Q={Q_scipy:.6f}, p={p_scipy:.6g}")
    print(f"Ours : Q={result['Q']:.6f}, p_mc\u2248{result['p_value_mc']:.6g}")

    # Iman-Davenport correction (often used in ML-algorithm comparison)
    Q = result["Q"]
    F = (n_blocks - 1.0) * Q / (n_blocks * (k - 1.0) - Q)
    p_F = stats.f.sf(F, k - 1, (k - 1) * (n_blocks - 1))
    print(f"Iman-Davenport: F={F:.6f}, p={p_F:.6g}")
SciPy: Q=49.950000, p=8.18748e-11
Ours : Q=49.950000, p_mc≈4.99975e-05
Iman-Davenport: F=52.102041, p=1.04458e-17

6) Practical notes, pitfalls, and next steps#

  • Direction matters: decide whether higher or lower values should get rank 1.

  • Don\u2019t ignore pairing: if blocks are not the same units across treatments, Friedman is the wrong tool.

  • Significant Friedman does not localize differences.

    • Common post-hoc options: Nemenyi (all-pairs on mean ranks) or pairwise Wilcoxon signed-rank with multiplicity correction.

  • Ties are common in discrete scores (e.g., integer ratings). Use average ranks and a tie correction.

If your data are independent groups (not repeated measures), look at Kruskal\u2013Wallis instead.

# Sanity check: when there is no systematic treatment effect, p-values should look non-significant on average.

n_blocks_0 = 24
k_0 = 4

baseline0 = rng.uniform(0.65, 0.80, size=n_blocks_0)
effects0 = np.zeros(k_0)
scores0 = np.clip(
    baseline0[:, None] + effects0[None, :] + rng.normal(0, 0.02, size=(n_blocks_0, k_0)),
    0.0,
    1.0,
)

res0 = friedman_test_numpy(scores0, higher_is_better=True, n_resamples=20000, seed=999)
res0["Q"], res0["p_value_mc"], res0["W"]
(0.8500000000000227, 0.8501574921253937, 0.011805555555555871)

Exercises#

  1. Change effects to make the algorithms closer together. How do Q, p, and W change?

  2. Increase/decrease n_blocks and see how the test\u2019s sensitivity changes.

  3. Create a dataset with deliberate ties (rounded scores) and see how the tie correction factor behaves.

References#

  • Friedman (1937): The use of ranks to avoid the assumption of normality implicit in the analysis of variance

  • Dem\u0161ar (2006): Statistical Comparisons of Classifiers over Multiple Data Sets

  • scipy.stats.friedmanchisquare